¡Hola!

Name: Maite Giménez.
Twitter: @maidotgimenez
Github: maigimenez

Who?:
- Ph.D. student at Universitat Politècnica de València.
- University Spokeperson of Spanish Python Association (¡Hazte socio/a!)

Where are the slides?

It's a notebook and it's available here: Github/maigimenez/ep2016_vect4word

Roadmap

Definition of the problem.
1. Text representation in NLP.
2. An NLP problem: Binary gender classification.
Get some data.
Bag of words.
Word embeddings
1. Pre-trained word embeddings.
2. Train your own model.
Compare your approaches
Discuss!

Science is broken

1. Definition of the problem

1.1. Words are NOT numbers

Most of the machine learning algorithms are designed to address problems trained with numeric features.

But in natural language processing we are dealing with words, so...

1.2. Binary gender classification is dead

Schler, J., Koppel, M., Argamon, S., & Pennebaker, J. W. (2006, March). Effects of Age and Gender on Blogging. In AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs (Vol. 6, pp. 199-205). read the paper

Feature	Male	Female
linux	0.53 ± 0.04	0.03 ± 0.01
india	0.62 ± 0.04	0.15 ± 0.01
programming	0.36 ± 0.02	0.08 ± 0.01
google	0.90 ± 0.04	0.19 ± 0.02
software	0.99 ± 0.05	0.17 ± 0.02
shopping	0.66 ± 0.02	1.48 ± 0.03
mom	2.07 ± 0.05	4.69 ± 0.08
cried	0.31 ± 0.01	0.72 ± 0.02
freaked	0.08 ± 0.01	0.21 ± 0.01
cute	0.83 ± 0.03	2.32 ± 0.04

Word frequency (per 10000 words) and standard error by gender

2. Get some data!

2.1 Gather your data



In [1]:

    
from configparser import ConfigParser
from os.path import join
from os import pardir



In [2]:

    
config = ConfigParser()
config.read(join(pardir,'src','credentials.ini'))

APP_KEY = config['twitter']['app_key']
APP_SECRET = config['twitter']['app_secret']
OAUTH_TOKEN =  config['twitter']['oauth_token']
OAUTH_TOKEN_SECRET =  config['twitter']['oauth_token_secret']



In [3]:

    
from twitter import oauth, Twitter, TwitterHTTPError



In [ ]:

    
auth = oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
                   APP_KEY, APP_SECRET)

twitter_api = Twitter(auth=auth)
twitter_api.retry = True

Full disclaimer: gender is non a binary issue. This is just a simplified example. If you are willing to expand this experiment, go ahead and contact me!



In [ ]:

    
brogrammers = ['jakevdp', 'rasbt', 'GaelVaroquaux', 'amuellerml', 'fperez_org', 
               'fpedregosa', 'ogrisel', 'dontusethiscode', 'randal_olson', 'tdhopper' ] 
sisgrammers = ['pkafei', 'LorenaABarba', 'jessicamckellar', 'heddle317', 'diana_clarke',
               'wholemilk', 'spang', 'cecilycarver', 'juliaelman', 'b0rk']

brotweets = []
for bro in brogrammers: 
    brotweets.extend(twitter_api.statuses.user_timeline(screen_name=bro, count=100))

sistweets = []
for sis in sisgrammers: 
    sistweets.extend(twitter_api.statuses.user_timeline(screen_name=sis, count=100))

2.2. Clean your data



In [ ]:

    
import re

def clean_tweet(tweet):
    """ Simplest preprocess.

    Convert a tweet to lowercarse and replace URLs and @username by a generic token

    Args:
        tweet (str): Tweet to clean.

    Returns:
        str: Preprocessed tweet
    """
    
    tweet = tweet.lower()
    
    # Remove URL and replace them with a token
    URL_REGEX = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    tweet = re.sub(URL_REGEX, '<url>', tweet, flags=re.MULTILINE)
    
    # Remove usernames and replace them with a token
    tweet = re.sub("@([A-Za-z0-9_]+)", "<user>", tweet)

    # Remove repeated spaces
    tweet = re.sub(r"\s{2,}", " ", tweet)

    # If a character is repeated more than 4 time, keep only 3 repetitions.
    tweet = re.sub(r'(.)\1{4,}', r'\1\1\1', tweet)
    
    return tweet



In [ ]:

    
import pandas as pd

dataset = []
# Gather the text 
for tweet in brotweets:
    cleaned_tweet = clean_tweet(tweet['text'])
    dataset.append({'id': tweet['id'], 'text': cleaned_tweet, 'class': 0})
for tweet in sistweets:
    cleaned_tweet = clean_tweet(tweet['text'])
    dataset.append({'id': tweet['id'], 'text': cleaned_tweet, 'class': 1})

pd_dataset = pd.DataFrame(dataset)



In [ ]:

    
pd_dataset.head()



In [ ]:

    
pd_dataset.to_csv('../corpora/full_dataset.csv')



In [ ]:

    
pd_dataset[['class', 'id']].to_csv('../corpora/ep16.csv')

2.4. Study your data



In [4]:

    
import pandas as pd

DATASET_PATH = "../corpora/full_dataset.csv"
pd_dataset = pd.DataFrame.from_csv(DATASET_PATH)
pd_dataset.head()









    Out[4]:






  
    
      
      class
      id
      text
    
  
  
    
      0
      0
      752137929871462400
      <user> great seeing you today!
    
    
      1
      0
      751510274432131072
      "i just wanted to check that you got my previo...
    
    
      2
      0
      751374701344219136
      rt <user>: <user> support for minor bugs may e...
    
    
      3
      0
      751373414435328001
      could be controversial, but frankly i think it...
    
    
      4
      0
      751373066987573248
      til that <user> plans to drop python 2 support...



In [5]:

    
import nltk.data
#nltk.download()



In [6]:

    
from nltk.tokenize import TweetTokenizer, word_tokenize
from nltk.corpus import stopwords
from collections import Counter
import re
import scipy.stats as stats



In [7]:

    
print(', '.join(stopwords.words('english')[:20]))









    



i, me, my, myself, we, our, ours, ourselves, you, your, yours, yourself, yourselves, he, him, his, himself, she, her, hers



In [8]:

    
def get_vocabulary(corpus, tokenizer):
    """ Get the vocabulary of a dataset. 

    Get a vocabulary of a set of tweets after removing stopwords, non letters, 
    and replacing each number by the token <number>

    Args:
        corpus (list of tweets): A list of tweets.
        tokenizer (function): tokenizer function. To get the tokens of each tweet.

    Returns:
        Counter: Vocabulary with the frequency of each word in it.
    """
    stop_words = stopwords.words('english')

    # Remove puntuation marks
    no_punks = [re.sub(r'\W', ' ', tweet) for tweet in corpus]
    
    # Tokenize and remove stop words
    clean_tokens = []
    for tweet in no_punks:
        # Replace different numbers with a token
        tweet = re.sub(r"\.\d+\s*", ".<number> ", tweet)
        tweet = re.sub(r"\d+\s*", " <number> ", tweet)
    
        tokens = tokenizer(tweet)
        tokens = [token for token in tokens if token not in stop_words]
        clean_tokens.extend(tokens)

    # Build the vocabulary
    return Counter(clean_tokens)



In [9]:

    
tknzr = TweetTokenizer()

brotweets = pd_dataset[pd_dataset['class'] == 0]['text'].tolist()
sistweets = pd_dataset[pd_dataset['class'] == 1]['text'].tolist()

brocabulary = get_vocabulary(brotweets, tknzr.tokenize)
siscabulary = get_vocabulary(sistweets, tknzr.tokenize)



In [10]:

    
brocabulary.most_common(10)









    Out[10]:





[('user', 1313),
 ('url', 676),
 ('<number>', 483),
 ('rt', 420),
 ('python', 96),
 ('talk', 49),
 ('learning', 46),
 ('data', 45),
 ('new', 43),
 ('amp', 42)]



In [11]:

    
siscabulary.most_common(10)









    Out[11]:





[('user', 1009),
 ('url', 419),
 ('<number>', 337),
 ('rt', 234),
 ('pycon', 54),
 ('one', 40),
 ('sgdq', 40),
 ('get', 39),
 ('like', 37),
 ('code', 30)]



In [12]:

    
from bokeh.plotting import figure, show, vplot, ColumnDataSource
from bokeh.io import output_notebook
from bokeh.models import HoverTool

output_notebook()









    





    
        
        Loading BokehJS ...



In [13]:

    
MOST_COMMON = 50

mc_brocavulary = brocabulary.most_common(int(MOST_COMMON/2))
mc_siscavulary = siscabulary.most_common(int(MOST_COMMON/2))

fr_brocavulary, fr_siscavulary  = [], []
most_common_words = mc_brocavulary + mc_siscavulary
words = list(set(word for word, _ in most_common_words))
for word in words:
    if word in brocabulary:
        fr_brocavulary.append(brocabulary[word])
    else:
        fr_brocavulary.append(0)
    if word in siscabulary:
        fr_siscavulary.append(siscabulary[word])
    else:
        fr_siscavulary.append(0)



In [14]:

    
import numpy as np
range_words=list(range(1,len(words)+1))
source = ColumnDataSource(data=dict(range_words=range_words,
                                    words=words,
                                    freq_true=fr_brocavulary, 
                                    freq_false=fr_siscavulary))

hover = HoverTool()
hover.point_policy = "follow_mouse"
hover = HoverTool(
        tooltips=[
            ("words", "@words"),
        ]
    )

TOOLS="pan,wheel_zoom,box_zoom,reset,save"


p = figure(title = "Vocabulary gender", x_range=words, tools=[TOOLS, hover])
p.xaxis.axis_label = 'Words'
p.yaxis.axis_label = 'Frequency'
p.circle('range_words', 'freq_true', source=source, fill_alpha=0.2, size=10, color="navy")
p.circle('range_words', 'freq_false', source=source, fill_alpha=0.2, size=10, color='red')
p.xaxis.major_label_orientation = np.pi/4  

show(p)









    






    
        
    







    Out[14]:




<Bokeh Notebook handle for In[14]>



In [15]:

    
tweet_lens_bro = [len(tweet) for tweet in brotweets]
hist_bro, edges_bro = np.histogram(tweet_lens_bro, density=True, bins=20)
tweet_lens_bro.sort()

tweet_lens_sis = [len(tweet) for tweet in sistweets]
hist_sis, edges_sis = np.histogram(tweet_lens_sis, density=True, bins=20)
tweet_lens_sis.sort()

p = figure(title="")
p.quad(top=hist_bro, bottom=0, left=edges_bro[:-1], right=edges_bro[1:],
        fill_color="navy", line_color="#033649", fill_alpha=0.3)
p.quad(top=hist_sis, bottom=0, left=edges_sis[:-1], right=edges_sis[1:],
        fill_color="red", line_color="#033649", fill_alpha=0.3)

sigma =  np.std(tweet_lens_bro)
mu = np.mean(tweet_lens_bro)
pdf = stats.norm.pdf(tweet_lens_bro, mu, sigma)
p.line(tweet_lens_bro, pdf, line_color="navy", line_width=6, alpha=0.7, legend="PDF")

sigma =  np.std(tweet_lens_sis)
mu = np.mean(tweet_lens_sis)
pdf = stats.norm.pdf(tweet_lens_sis, mu, sigma)
p.line(tweet_lens_sis, pdf, line_color="red", line_width=6, alpha=0.7, legend="PDF")

p.xaxis.axis_label = 'len(tweets)'
p.yaxis.axis_label = '# tweets'
show(p)









    






    
        
    







    Out[15]:




<Bokeh Notebook handle for In[15]>

2.5. Split your data in train and test

(and probably share it too)



In [16]:

    
from sklearn import cross_validation
X = pd_dataset['text'].tolist()
y = pd_dataset['class'].tolist()

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.4, random_state=0)
print("Examples in train: {}".format(len(X_train)))
print("Examples in test: {}".format(len(X_test)))









    



Examples in train: 1200
Examples in test: 800

3. In the beginning we had a bag of words (or maybe the set)

3.1. What is a bag of words

Simplest way to represent text.
Create a vector with the size of the vocabulary seen in train.
Each sentence is represented counting the number of times each word appears.

dataset = ["I love Python", "I love NLP", "Pyladies are cool"]
vocabulary = set(["I", "love", "Python", "NLP", "Pyladies", "are", "cool"])
dataset_representation =  [[1,1,1,0,0,0,0], 
                           [1,1,0,1,0,0,0],
                           [0,0,0,0,1,1,1]]

Ok, that looks cool. This look solved. Let's party!

3.2. So, what's the problem, then?

Unseen words.

One-hot Representation: Represents every word as an $\mathbb{R}^{|V|×1}$

hotel = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
motel = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
w hotel AN D w motel = 0

$w^{hotel} AND w^{motel} = 0

The curse of dimensionality!!!
- Generalizing locally (eg. nearest neighbors) requires representative examples for all relevant variations
- The number of possible configurations of the variables of interest is much larger than the number of training samples.



In [17]:

    
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer = "word", 
                             tokenizer = None,
                             preprocessor = None,
                             stop_words = None,
                             ngram_range=(1, 1),
                             max_features = 5000) 

# Fit the train
BOW_train = vectorizer.fit_transform(X_train)
BOW_train = BOW_train.toarray()

# Transform the test
BOW_test = vectorizer.transform(X_test)
BOW_test = BOW_test.toarray()


print('Train: {{0|1}}^({}x{})'.format(BOW_train.shape[0], BOW_train.shape[1]))
print('Test:  {{0|1}}^({}x{})'.format(BOW_test.shape[0], BOW_test.shape[1]))
vocab = vectorizer.get_feature_names()
print('\nVOCABULARY EXTRACT: {}'.format(', '.join(vocab[500:600])))
np.set_printoptions(threshold=np.nan)
print('\nTWEET REPRESENTATION: {}'.format(BOW_train[0]))









    



Train: {0|1}^(1200x4060)
Test:  {0|1}^(800x4060)

VOCABULARY EXTRACT: broke, broken, brother, brownlee, browser, bruce, brunch, brutal, btw, bu, buddy, buffer, bug, bugfixes, buggy, bugs, build, building, built, bullies, bump, bunch, burnt, bursting, busy, but, butler, button, buy, buys, buzz, by, byegium, ca, cabin, cable, cableporn, caesar, caged, call, called, calling, calls, came, camp, campaign, camping, can, canada, candidate, cannot, capita, capitalism, capitalists, capn, captain, captains, capture, car, card, cardinality, cards, career, careful, caribbean, caring, carpentry, cars, cartoon, case, cash, cat, catch, categorical, catered, cats, catty, caught, causal, cause, caused, cc, celebrate, celebrating, cena, centenial, center, centrality, centre, century, cernan, certainly, ceval, cfd, cffi, cfp, chair, chalk, challenge, challenges

TWEET REPRESENTATION: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

4. Word embeddings

You shall know a word by the company it keeps.

-- J.R. Firth 1957:11

The holy grail in text representation!

Word2vec

Ok, we need to talk.

Word2vec is NOT an algorithm for learning word embeddings. Word2vec is a family of algorithms for learning word embeddings. You may want to talk about a skipgram model trained using negative sampling.
Word2vec is NOT the holy grail (If you don't understand it)
(Sligthly related note) Tensorflow is awesome, but we should get used to allocate memory

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).

Now, we can go on.

With Word2vec algorithms are able to learn word embedings from unsupervised raw text.
Two algorithms are implemented:
- Continuous Bag-of-Words model (CBOW)
  - Predicts target words (e.g. 'mat') from source context words ('the cat sits on the').
  - CBOW smoothes over a lot of the distributional information.
  - Useful for smaller datasets.
- Skip-Gram model
  - Predicts source context words ('the cat sits on the') from target words (e.g. 'mat').
  - Treats each context-target pair as a new observation.
  - Useful for larger datasets
These algorithms are trained using:
- Negative sampling.
- Hierarchical Softmax

Read the complete tutorial

Glove

Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global Vectors for Word Representation. In EMNLP (Vol. 14, pp. 1532-43).

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

New global logbilinear regression model that combines the advantages of the two major model families in the literature:
- Global matrix factorization
- Local context window methods.

4.1. Pretrained data using GloVe



In [18]:

    
from os.path import join



In [19]:

    
import numpy as np

def load_glove_dict(glove_filepath):
    """ Build a dictionary with GloVe values. 

    Read the GloVe resource and build a dictionary where the key is the word 
    and the value its GloVe representation

    Args:
        glove_filepath (str): Path where the GloVe resource is.
        
    Returns:
        dict: Dictionary with GloVe data.
    """
    glove_embeddings = {}
    # TODO: check if the filepath exists
    with open(glove_filepath) as glove_file:
        for line in glove_file:
            split_line = line.split() 
            word, vector = split_line[0], np.asarray(split_line[1:])
            glove_embeddings[word] =  vector
    return glove_embeddings



In [20]:

    
GLOVE_PATH = '../../../resources/GloVe/twitter_dataset'

embedding_size = '25'
glove_file = join(GLOVE_PATH, 'glove.twitter.27B.' + embedding_size + 'd.txt')
glove_25 = load_glove_dict(glove_file)



In [21]:

    
embedding_size = '100'
glove_file = join(GLOVE_PATH, 'glove.twitter.27B.' + embedding_size + 'd.txt')
glove_100 = load_glove_dict(glove_file)



In [23]:

    
def get_most_common_vocab(most_common, vocabulary):
    """ Get the most common words in a vocabulary

    Args:
        most_common (int): Number of most common word that want to be retrieved.
        vocabulary (Counter): Vocabulary with words and frequencies of each word.
        
    Returns:
        set: set of most common words in this vocabulary.
    """
    most_common_words = vocabulary.most_common(int(most_common))
    return set(word for word, _ in most_common_words)


def get_words_to_plot(most_common, vocabulary, dictionary):
    words_to_plot = {}
    unseen_words = []
    for word in get_most_common_vocab(most_common, vocabulary):
        if word in dictionary:
            words_to_plot[word] = dictionary[word]
        else:
            unseen_words.append(word)
    return words_to_plot, unseen_words



In [24]:

    
from sklearn.manifold import TSNE

def plot_tsne(dictionary, most_common):
    tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
   
    words_to_plot_bros, unseen_words_bros = get_words_to_plot(most_common, brocabulary, dictionary)
    words_to_plot_sis, unseen_words_sis = get_words_to_plot(most_common, siscabulary, dictionary)

    low_dim_embs_bros = tsne.fit_transform(list(words_to_plot_bros.values()))
    low_dim_embs_sis = tsne.fit_transform(list(words_to_plot_sis.values()))
    
    words_bros = list(words_to_plot_bros.keys())
    range_words_bros=list(range(1,len(words_bros)+1))
    source_bros = ColumnDataSource(data=dict(range_words=range_words_bros,
                                             words_bros=words_bros,
                                             x=low_dim_embs_bros[:,0], 
                                             y=low_dim_embs_bros[:,1]))

    words_sis = list(words_to_plot_sis.keys())
    range_words_sis = list(range(1,len(words_sis)+1))
    source_sis = ColumnDataSource(data=dict(range_words=range_words_sis,
                                             words_sis=words_sis,
                                             x=low_dim_embs_sis[:,0], 
                                             y=low_dim_embs_sis[:,1]))
    
    
    hover = HoverTool()
    hover.point_policy = "follow_mouse"
    hover = HoverTool(
            tooltips=[
                ("words_bros", "@words_bros"),
                ("words_sis", "@words_sis"),
            ]
        )

    TOOLS="pan,wheel_zoom,box_zoom,reset,save"

    p = figure(title = "Word visualization", tools=[TOOLS, hover])
    p.circle('x', 'y', source=source_bros, fill_alpha=0.2, size=10, color='navy')
    p.circle('x', 'y', source=source_sis, fill_alpha=0.2, size=10, color='red')

    show(p)
    return set(unseen_words_bros + unseen_words_sis)



In [25]:

    
unseen_words = plot_tsne(glove_100, 1000)



In [26]:

    
print(', '.join(unseen_words))









    



europython, jmlr, datashader, preprint, print_statement, pypi, icml, numfocus, reproducibility, bibtex, jupyter, hyperparameter, euroscipy, seabikes, rwatch, uwsgi, ngcmsummeracademy, percona, numpy, joblib, cython, neuroimaging, iclr, prni, blacklivesmatter, polyconf, barbagroup, pydatabln, alifexv, wikitables, wssspe, colormap, brexit, factorization, sourceness, mcpenface, ohbm, dadaist, centenial, decoders, pydata, moogles, plotly, subsampling, orlandoshooting, benv, lstms, perceptron, nilearn, datascience, pyconopenspace, scipy, proxysql, prereleas, resnet, __past__, machinelearning, tensorflow, brainhack, ceval, pytn, matplotlib, cffi, sklearn, ipython, scikit, sparsity, prompt_toolkit, pyconca, cardinality, automl, openai, pydataparis, kubernetes, theano, pydatalondon, deeplearning, sgdq



In [27]:

    
def tokenize_dataset(tokenizer, dataset):
    tokenize_dataset = []
    for tweet in dataset:
        # Replace different numbers with a token
        tweet = re.sub(r"\.\d+\s*", ".<number> ", tweet)
        tweet = re.sub(r"\d+\s*", " <number> ", tweet)
        tokens = tokenizer(tweet)
        tokenize_dataset.append(tokens)
    return tokenize_dataset



In [28]:

    
X_train_tokenized = tokenize_dataset(TweetTokenizer().tokenize, X_train)
X_test_tokenized = tokenize_dataset(TweetTokenizer().tokenize, X_test)



In [29]:

    
def get_embeddings(dataset, dictionary, embedding_size):
    X_emebeddings = []
    for tweet in dataset:
        tweet_embeddings = []
        for word in tweet:
            if word in dictionary:
                tweet_embeddings.append(dictionary[word])
        if not tweet_embeddings:
            tweet_embeddings.append(np.zeros(embedding_size))
        # Each tweet would have a different number of words and ML techniques requiere fixed inputs. 
        X_emebeddings.append(np.mean(np.asarray(tweet_embeddings, dtype=np.float32), axis=0))
    return X_emebeddings



In [30]:

    
X_train_GloVe = get_embeddings(X_train_tokenized, glove_100, 100)
X_test_GloVe = get_embeddings(X_test_tokenized, glove_100, 100)

4.2. But I want to train word embeddings with my own data!



In [31]:

    
from gensim.models import word2vec



In [32]:

    
# Initialize and train the model (this will take some time)
model = word2vec.Word2Vec(X_train_tokenized, 
                          workers = 4,
                          size = 100,  
                          min_count = 1,     # How many times a word should appear to be taken into account
                          window = 5, 
                          sample = 1e-3 ,    # Downsample setting for frequent words
                          batch_words = 100) # Batches of examples passed to worker threads 

# This model won't be updated
model.init_sims(replace=True)

model_name = "word2vec"
model.save(model_name)



In [33]:

    
model.syn0.shape









    Out[33]:





(4193, 100)



In [34]:

    
_ = plot_tsne(model, 1000)



In [35]:

    
model.most_similar("python")









    Out[35]:





[('and', 0.9998661279678345),
 ('"', 0.9998612403869629),
 ('.', 0.9998607635498047),
 ('of', 0.9998567700386047),
 ('in', 0.9998563528060913),
 ('a', 0.9998547434806824),
 ('!', 0.9998537302017212),
 ('about', 0.9998530149459839),
 (',', 0.99985271692276),
 ('on', 0.9998509287834167)]



In [36]:

    
X_train_word2vec = get_embeddings(X_train_tokenized, model, 100)
X_test_word2vec = get_embeddings(X_test_tokenized, model, 100)

5. Now fight!



In [37]:

    
from sklearn import svm
from sklearn.metrics import classification_report, roc_curve, auc



In [38]:

    
clf = svm.SVC()
clf.fit(BOW_train, y_train)  
predicction_BOW = clf.predict(BOW_test)
target_names = ['Bros', 'Sis']
print(classification_report(y_test, predicction_BOW,  target_names=target_names))









    



             precision    recall  f1-score   support

       Bros       0.00      0.00      0.00       411
        Sis       0.49      1.00      0.65       389

avg / total       0.24      0.48      0.32       800



In [39]:

    
clf = svm.SVC()
clf.fit(X_train_GloVe, y_train)
predicction_GloVe = clf.predict(X_test_GloVe)
print(classification_report(y_test, predicction_GloVe,  target_names=target_names))









    



             precision    recall  f1-score   support

       Bros       0.66      0.60      0.62       411
        Sis       0.61      0.67      0.64       389

avg / total       0.63      0.63      0.63       800



In [40]:

    
clf = svm.SVC()
clf.fit(X_train_word2vec, y_train)
predicction_word2vec = clf.predict(X_test_word2vec)
print(classification_report(y_test, predicction_word2vec,  target_names=target_names))









    



             precision    recall  f1-score   support

       Bros       0.00      0.00      0.00       411
        Sis       0.49      1.00      0.65       389

avg / total       0.24      0.49      0.32       800







    



/Users/mai/Dev/tools/miniconda3/envs/ep16/lib/python3.5/site-packages/sklearn/metrics/classification.py:1074: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)



In [41]:

    
false_positive_rate_bow, true_positive_rate_bow, _ = roc_curve(y_test, predicction_BOW)
roc_auc_bow = auc(false_positive_rate_bow, true_positive_rate_bow)

false_positive_rate_glove, true_positive_rate_glove, _ = roc_curve(y_test, predicction_GloVe)
roc_auc_glove = auc(false_positive_rate_glove, true_positive_rate_glove)

false_positive_rate_w2v, true_positive_rate_w2v, _ = roc_curve(y_test, predicction_word2vec)
roc_auc_w2v = auc(false_positive_rate_w2v, true_positive_rate_w2v)



In [42]:

    
from bokeh.palettes import Spectral6

p = figure(title="Receiver Operating Characteristic", tools=TOOLS)


p.line(false_positive_rate_bow, true_positive_rate_bow, legend='BoW ROC curve (area = {:.2f})'.format(roc_auc_bow), 
       line_color="green", line_width=2)
p.line(false_positive_rate_glove, true_positive_rate_glove, 
       legend='GloVE ROC curve (area = {:.2f})'.format(roc_auc_glove), 
       line_color="blue", line_width=2)
p.line(false_positive_rate_w2v, true_positive_rate_w2v, 
       legend='W2V ROC curve (area = {:.2f})'.format(roc_auc_w2v), 
       line_color="yellow", line_width=2)


p.line([0.0, 1.0], [0.0, 1.05], legend='Guessing', 
       line_color="gray", line_width=2, line_dash=(4, 4))

p.xaxis.axis_label = 'False Positive Rate'
p.yaxis.axis_label = 'True Positive Rate'

p.legend.location = 'bottom_right'
show(p)









    






    
        
    







    Out[42]:




<Bokeh Notebook handle for In[42]>

6. Discussion

Science is NOT broken some scientist are.
Understand your algorithms.
If you have guessed that maybe the mean is not the best option for representing sentences you may want to use a Neural Network toolkit try Layers.
And finally ...

Thank you!

Nos vemos en la PyCon ES 2016

	id	text
0	752137929871462400	<user> great seeing you today!
1	751510274432131072	"i just wanted to check that you got my previo...
2	751374701344219136	rt <user>: <user> support for minor bugs may e...
3	751373414435328001	could be controversial, but frankly i think it...
4	751373066987573248	til that <user> plans to drop python 2 support...